Background:
The dataset we’re working on was taken from the tidytuesday repository
It provides basically transcripts from the American sitcom f.r.i.e.n.d.s, in addition to other information.
In order to work with this dataset we’ll need to install the friends R Package.
Goals:
Test the assumption that women talk more than men per episode. in order to do that we’re willing on conducting a t-test.
Find the relationship between IMDB rating for each episode and it’s views in the USA. in order to do that we’re willing to perform a linear regression.
Let’s get started!
#install.packages("friends")
suppressWarnings(suppressMessages(library(tidyverse)))
suppressWarnings(suppressMessages(library(friends)))
the friends library contains multiple datasets. we’ll be
working with two of them in order to perform the test were mentioned
earlier.
the first dataset is friends_info which contains data about
each episodes information. Basically we’ll focus on
us_views_millions and imdb_rating. here’s a
glimpse to the dataset:
glimpse(friends_info)
## Rows: 236
## Columns: 8
## $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ episode <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ title <chr> "The Pilot", "The One with the Sonogram at the End",~
## $ directed_by <chr> "James Burrows", "James Burrows", "James Burrows", "~
## $ written_by <chr> "David Crane & Marta Kauffman", "David Crane & Marta~
## $ air_date <date> 1994-09-22, 1994-09-29, 1994-10-06, 1994-10-13, 199~
## $ us_views_millions <dbl> 21.5, 20.2, 19.5, 19.7, 18.6, 18.2, 23.5, 21.1, 23.1~
## $ imdb_rating <dbl> 8.3, 8.1, 8.2, 8.1, 8.5, 8.1, 9.0, 8.1, 8.2, 8.1, 8.~
the second dataset is friends dataset. here’s a glimpse
of it:
glimpse(friends)
## Rows: 67,373
## Columns: 6
## $ text <chr> "There's nothing to tell! He's just some guy I work with!", ~
## $ speaker <chr> "Monica Geller", "Joey Tribbiani", "Chandler Bing", "Phoebe ~
## $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ episode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ scene <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ utterance <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
We’ve some missing data in this dataset which we need to perform our
t-test on words spoken by each character (Female or Male) per
episode.
In order to mine the data we looked first of all for the main and top
minor characters in the show. we chose top 38 characters (including the
6 main characters) who had the most lines in the show. The reason we had
to do that is due to lack of data about the gender of each character in
the dataset. we had these information from The
ceros interactive article in addition to the ordered data we
obtained as who talks the most:
talks_the_most <- friends %>%
mutate(num_of_words = str_count(text, "\\S+")) %>%
select(speaker, season, episode, scene, num_of_words) %>%
group_by(speaker) %>%
summarise(num_of_words = sum(num_of_words)) %>%
arrange(desc(num_of_words))
talks_the_most
## # A tibble: 700 x 2
## speaker num_of_words
## <chr> <int>
## 1 Rachel Green 97633
## 2 Ross Geller 95561
## 3 Chandler Bing 86547
## 4 Joey Tribbiani 86426
## 5 Scene Directions 85114
## 6 Monica Geller 82988
## 7 Phoebe Buffay 81506
## 8 Mike Hannigan 3234
## 9 Janice Litman Goralnik 2595
## 10 Richard Burke 2579
## # ... with 690 more rows
Here’s how we made our data tidy:
female_main_characters <- c("Phoebe Buffay", "Rachel Green", "Monica Geller")
male_main_characters <- c("Chandler Bing", "Ross Geller", "Joey Tribbiani")
main_characters <- c(female_main_characters, male_main_characters)
#Top 15 minor male characters according to words spoken for the whole series
male_minor_characters <- c("Paul Stevens",
"David",
"Tag Jones",
"Gary",
"Frank Buffay Jr.",
"Gunther",
"Eddie Menuek",
"Richard Burke",
"Doug",
"Eric",
"Leonard Green",
"Mike",
"Peter Becker",
"Joshua Burgin",
"Jack Geller")
#Top 15 minor female characters according to words spoken for the whole series
female_minor_characters <- c("Sandra Green",
"Emily Waltham",
"Charlie Wheeler",
"Susan Bunch",
"Judy Geller",
"Carol Willick",
"Janice Litman Goralnik",
"Phoebe Abbott",
"Mona",
"Jill Green",
"Kathy",
"Janine Lecroix",
"Amy Green",
"Joanna",
"Erica")
male_characters <- c(male_minor_characters, male_main_characters)
female_characters <- c(female_main_characters, female_minor_characters)
characters <- c(female_characters, male_characters)
## a tidy dataset for friends series
tidy_friends <- friends %>%
filter(speaker %in% characters) %>% #choose only main and top minor characters
mutate(num_of_words = str_count(text, "\\S+")) %>% #count the words spoken in each sentence
mutate(gender = ifelse(speaker %in% female_characters, "F", "M")) %>% #specify the gender of each character
select(speaker, season, episode, scene, num_of_words, gender)
tidy_friends
## # A tibble: 54,558 x 6
## speaker season episode scene num_of_words gender
## <chr> <int> <int> <int> <int> <chr>
## 1 Monica Geller 1 1 1 11 F
## 2 Joey Tribbiani 1 1 1 14 M
## 3 Chandler Bing 1 1 1 16 M
## 4 Phoebe Buffay 1 1 1 5 F
## 5 Phoebe Buffay 1 1 1 16 F
## 6 Monica Geller 1 1 1 21 F
## 7 Chandler Bing 1 1 1 6 M
## 8 Chandler Bing 1 1 1 22 M
## 9 Chandler Bing 1 1 1 11 M
## 10 Joey Tribbiani 1 1 1 2 M
## # ... with 54,548 more rows
further tidying will be done in future tests in the project.
words_per_season <- tidy_friends %>%
filter(speaker %in% characters) %>%
mutate(season = factor(season)) %>%
group_by(speaker, season) %>%
summarise(num_of_words = sum(num_of_words)) %>%
mutate(gender = ifelse(speaker %in% female_characters, "F", "M"))
words_per_season %>%
ggplot(aes(x = season, y = num_of_words, fill = gender)) +
geom_col() +
labs(title = "How many words each gender spoke per season") +
theme(plot.title = element_text(hjust = 0.5)) +
ylab("Number of words") +
xlab("Season") +
coord_flip()
This plot shows us multiple things:
males and females share almost the same percentage of words spoken each season.
at seasons 6 and 9 the characters just talked! a lot !!!.
they calmed down at season 10 and scored the least amount of words spoken per season.
talks_the_most %>%
filter(speaker %in% characters) %>%
ggplot(aes(x = speaker, y =num_of_words)) +
geom_col(aes(fill=speaker)) +
coord_flip() +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
labs(y = "Number of words", title = "Who spoke the most?")
we can see that Rachel and Ross won the most talkative characters in the show. well, maybe because of all the “we were on a break” thing
## words spoken per episodes by each character
words_per_episode <- tidy_friends %>%
group_by(speaker, season, episode) %>%
summarise(num_of_words = sum(num_of_words))
words_per_episode %>%
filter(speaker %in% main_characters) %>%
ggplot(aes(x = speaker, y = num_of_words)) +
geom_boxplot(aes(fill=speaker)) +
coord_flip() +
labs(title = "distribution of words spoken per episode", y = "Number of words", x = "", subtitle = "among main characters") +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
although we have the data about all of the population, we can’t categorise each character to male or female since we have a huge number of speakers and the dataset doesn’t specify their gender, so we don’t know their mean or variance. That’s why we’re going to conduct a t-test for our sample.
Let’s define some variables:
\(\mu_{f}\): mean of words spoken by
female characters per episode
\(\mu_{m}\): mean of words spoken by
male characters per episode
our hypothesis is:
\(H_0: \mu_{f} - \mu_{m} = 0\)
\(H_1: \mu_{f} - \mu_{m} > 0\)
Our Assumptions:
* Variance of number of words spoken by males and females are unequal
(that doesn’t really change anything in the t-test as we checked).
* number of words are normally distributed:
let’s check on that.
words_per_episode %>%
ggplot(aes(num_of_words)) +
geom_density()
Since we can’t say that they’re normally distributed and they’re limited
by zero as minimum, we can ignore that since we have a large enough
sample size.
# t-test for two unpaired samples
words_per_episode_fm <- words_per_episode %>%
mutate(gender = ifelse(speaker %in% female_characters, "F", "M"))
t.test(formula = words_per_episode_fm$num_of_words ~ words_per_episode_fm$gender, alternative = "greater", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: words_per_episode_fm$num_of_words by words_per_episode_fm$gender
## t = -0.26866, df = 1671.8, p-value = 0.6059
## alternative hypothesis: true difference in means between group F and group M is greater than 0
## 95 percent confidence interval:
## -16.11855 Inf
## sample estimates:
## mean in group F mean in group M
## 335.9652 338.2272
we can see that we got a very high p-value = 0.6059 and it’s higher than \(\alpha = 0.05\). so we can’t reject the Null Hypothesis and we can, in 95% confidence say that the mean of words spoken by males and females in FRIENDS series per episode are equal.
We will build a model that expresses the relationship (if there is) between the character’s gender and words they speak per episode.
well, maybe not that kind of model
words_per_episode %>%
mutate(gender = ifelse(speaker %in% female_characters, "F", "M")) %>%
lm(formula = num_of_words ~ gender) %>%
summary()
##
## Call:
## lm(formula = num_of_words ~ gender, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -337.23 -114.47 -4.97 107.90 975.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 335.965 5.976 56.222 <2e-16 ***
## genderM 2.262 8.423 0.269 0.788
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 172.6 on 1677 degrees of freedom
## Multiple R-squared: 4.3e-05, Adjusted R-squared: -0.0005533
## F-statistic: 0.07211 on 1 and 1677 DF, p-value: 0.7883
we notice that we can’t significantly decide that there’s a linear relation between the number of words spoken by each character per episode and the gender of the character (p-value = 0.788).
Let’s check the first dataset friends_info in the
friends package and see if there’s a relationship between
the IMDB rating and views in USA.
at first let’s have a visual indication:
friends_info %>%
ggplot(mapping = aes(x = us_views_millions, y = imdb_rating)) +
geom_point() +
theme_minimal()
There are 3 outliers, we’ll remove them to have better indication.
friends_info <- friends_info %>%
filter(us_views_millions < 50)
friends_info %>%
ggplot(mapping = aes(x = us_views_millions, y = imdb_rating)) +
geom_point() +
theme_minimal()
friends_info_lm <- friends_info %>%
lm(formula = imdb_rating ~ us_views_millions)
friends_info_lm %>%
summary()
##
## Call:
## lm(formula = imdb_rating ~ us_views_millions, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14531 -0.24731 -0.02954 0.21320 1.16609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.691313 0.158578 48.502 < 2e-16 ***
## us_views_millions 0.030419 0.006295 4.832 2.47e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3669 on 230 degrees of freedom
## Multiple R-squared: 0.09217, Adjusted R-squared: 0.08822
## F-statistic: 23.35 on 1 and 230 DF, p-value: 2.467e-06
This regression model indicates that there’s a linear relationship
between the IMDB rating of each episode and the views in USA. the views
predict (significantly, according to their p-value < 0.05 and
f-statistic p-value < 0.05) the rating it will get in IMDB.
R-squared is 0.09217 which means that the variance of IMDB rating can
be, by only 9.217%, explained by the variance of the views in USA.
Let’s express that relationship visually:
friends_info %>%
ggplot(mapping = aes(x = us_views_millions, y = imdb_rating)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm")
our model indicates that:
\(rating = 7.691313 + 0.030419 \cdot views + \epsilon\) which means every 1 million views can higher the rating for the episode by 0.030419.
although, one of the base assumption in linear regression is
that
\(\epsilon \sim \mathcal{N}(0,
\sigma_\epsilon)\)
we’ll check if our assumption is right.
qqnorm(friends_info_lm$residuals)
qqline(friends_info_lm$residuals)
another assumption is homoscedacity.
let’s also check that:
plot(friends_info_lm, which = c(1, 1))
we can see in the qqplot, the residuals are almost falling on the
straight line, and in the in the Residuals vs Fitted values (in which
the red line defines the variance of the residuals) that this line is
almost straight. We can’t say that the residuals \(\epsilon\) are perfectly normally
distributed (qqplot), and their variance are constant (Rsiduals vs
Fitted).
But as said in the lectures we can ignore that and move on, and as Joey
answered to Monica’s question “how can you not care?”:
like this
one thing to be aware of is that the rating for each episode is limited.
This section is just for fun.
Every fan of F.R.I.E.N.D.S memorizes all of the catch phrases told by
each one of the characters, let’s have fun with them.
friends %>%
mutate(on_a_break = str_detect(text, "on a break")) %>%
select(speaker, on_a_break) %>%
filter(on_a_break == TRUE) %>%
count(speaker) %>%
ggplot(aes(x = speaker, y = n)) +
geom_col(aes(fill=speaker)) +
coord_flip() +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
labs(title = "Who said `we were on a break` the most?", x = "")
Well, Ross’s the one who made the mistake it’s only logical he neede to defend himself.
Which catcphrase was said the most?
let’s see
friends %>%
mutate(on_a_break = str_detect(text, "on a break")) %>%
mutate(smelly_cat = str_detect(text, "smelly cat")) %>%
mutate(how_you_doin = str_detect(text, "how you doin")) %>%
select(on_a_break:how_you_doin) %>%
filter(smelly_cat == TRUE |
how_you_doin == TRUE |
on_a_break == TRUE) %>%
colSums()
## on_a_break smelly_cat how_you_doin
## 15 6 14